AITopics | malicious instruction

Collaborating Authors

malicious instruction

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

MoGU: A Framework for Enhancing Safety of LLMs While Preserving Their Usability

Neural Information Processing SystemsMar-21-2026, 20:13:14 GMT

Large Language Models (LLMs) are increasingly deployed in various applications. As their usage grows, concerns regarding their safety are rising, especially in maintaining harmless responses when faced with malicious instructions. Many defense strategies have been developed to enhance the safety of LLMs. However, our research finds that existing defense strategies lead LLMs to predominantly adopt a rejection-oriented stance, thereby diminishing the usability of their responses to benign instructions. To solve this problem, we introduce the MoGU framework, designed to enhance LLMs' safety while preserving their usability. Our MoGU framework transforms the base LLM into two variants: the usable LLM and the safe LLM, and further employs dynamic routing to balance their contribution. When encountering malicious instructions, the router will assign a higher weight to the safe LLM to ensure that responses are harmless.

artificial intelligence, large language model, natural language, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Uncovering Safety Risks of Large Language Models through Concept Activation Vector

Neural Information Processing SystemsFeb-18-2026, 06:42:00 GMT

Warning: This paper contains text examples that are offensive or harmful in nature.

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country:

North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
Europe > Sweden > Stockholm > Stockholm (0.04)
(4 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)
Government (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.96)

Add feedback

MoGU: A Framework for Enhancing Safety of LLMs While Preserving Their Usability

Neural Information Processing SystemsFeb-17-2026, 01:03:56 GMT

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

Asia > Middle East > Republic of Türkiye (0.04)
Asia > China > Heilongjiang Province > Harbin (0.04)

Genre: Research Report > Experimental Study (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Government (1.00)
Banking & Finance (0.67)
Law (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

Mitigating Indirect Prompt Injection via Instruction-Following Intent Analysis

Kang, Mintong, Xiang, Chong, Kariyappa, Sanjay, Xiao, Chaowei, Li, Bo, Suh, Edward

arXiv.org Artificial IntelligenceDec-2-2025

Indirect prompt injection attacks (IPIAs), where large language models (LLMs) follow malicious instructions hidden in input data, pose a critical threat to LLMpowered agents. In this paper, we present IntentGuard, a general defense framework based on instruction-following intent analysis. The key insight of Intent-Guard is that the decisive factor in IPIAs is not the presence of malicious text, but whether the LLM intends to follow instructions from untrusted data. Building on this insight, IntentGuard leverages an instruction-following intent analyzer (IIA) to identify which parts of the input prompt the model recognizes as actionable instructions, and then flag or neutralize any overlaps with untrusted data segments. To instantiate the framework, we develop an IIA that uses three "thinking intervention" strategies to elicit a structured list of intended instructions from reasoning-enabled LLMs. These techniques include start-of-thinking prefilling, end-of-thinking refinement, and adversarial in-context demonstration. We evaluate IntentGuard on two agentic benchmarks (AgentDojo and Mind2Web) using two reasoning-enabled LLMs (Qwen-3-32B and gpt-oss-20B). Results demonstrate that IntentGuard achieves (1) no utility degradation in all but one setting and (2) strong robustness against adaptive prompt injection attacks (e.g., reducing attack success rates from 100% to 8.5% in a Mind2Web scenario). Indirect prompt injection attacks (IPIAs) (Greshake et al., 2023), where large language models (LLMs) follow malicious instructions hidden in the input data, have emerged as a top security concern for LLM-powered agents. Although many defenses have been proposed, each faces fundamental limitations. Finetuning-based defenses (Chen et al., 2024; 2025b) are costly and lack interpretability; auxiliary classifiers for IPIA detection Shi et al. (2025); Hung et al. (2024) often fail to generalize and are vulnerable to adaptive attacks; system-level rule enforcement Debenedetti et al. (2025) can impact agent utility while offering little robustness against attacks that do not alter control and data flows (e.g., injecting misinformation or phishing links into an email summary). In this paper, we approach the prompt injection problem from a new perspective: instruction-following intent analysis. For an LLM to effectively follow instructions, it must have an internal mechanism to decide which parts of a prompt it recognizes as actionable instructions.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2512.00966

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.66)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Add feedback

The Shawshank Redemption of Embodied AI: Understanding and Benchmarking Indirect Environmental Jailbreaks

Li, Chunyang, Kang, Zifeng, Zhang, Junwei, Ma, Zhuo, Cheng, Anda, Li, Xinghua, Ma, Jianfeng

arXiv.org Artificial IntelligenceNov-21-2025

The adoption of Vision-Language Models (VLMs) in embodied AI agents, while being effective, brings safety concerns such as jailbreaking. Prior work have explored the possibility of directly jailbreaking the embodied agents through elaborated multi-modal prompts. However, no prior work has studied or even reported indirect jailbreaks in embodied AI, where a black-box attacker induces a jailbreak without issuing direct prompts to the embodied agent. In this paper, we propose, for the first time, indirect environmental jailbreak (IEJ), a novel attack to jailbreak embodied AI via indirect prompt injected into the environment, such as malicious instructions written on a wall. Our key insight is that embodied AI does not ''think twice'' about the instructions provided by the environment -- a blind trust that attackers can exploit to jailbreak the embodied agent. We further design and implement open-source prototypes of two fully-automated frameworks: SHAWSHANK, the first automatic attack generation framework for the proposed attack IEJ; and SHAWSHANK-FORGE, the first automatic benchmark generation framework for IEJ. Then, using SHAWSHANK-FORGE, we automatically construct SHAWSHANK-BENCH, the first benchmark for indirectly jailbreaking embodied agents. Together, our two frameworks and one benchmark answer the questions of what content can be used for malicious IEJ instructions, where they should be placed, and how IEJ can be systematically evaluated. Evaluation results show that SHAWSHANK outperforms eleven existing methods across 3,957 task-scene combinations and compromises all six tested VLMs. Furthermore, current defenses only partially mitigate our attack, and we have responsibly disclosed our findings to all affected VLM vendors.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2511.16347

Country: Asia > China (0.46)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

Amazon Explains How Its AWS Outage Took Down the Web

WIREDOct-25-2025, 10:30:00 GMT

Plus: The Jaguar Land Rover hack sets an expensive new record, OpenAI's new Atlas browser raises security fears, Starlink cuts off scam compounds, and more. The cloud giant Amazon Web Services experienced DNS resolution issues on Monday leading to cascading outages that took down wide swaths of the web . Monday's meltdown illustrated the world's fundamental reliance on so-called hyperscalers like AWS and the challenges for major cloud providers and their customers alike when things go awry . See below for more about how the outage occurred. US Justice Department indictments in a mob-fueled gambling scam reverberated through the NBA on Thursday.

aw outage took, browser, compound, (12 more...)

WIRED

Country:

Asia > Middle East > Palestine (0.14)
Asia > Myanmar (0.06)
North America > United States > California > San Francisco County > San Francisco (0.05)
(9 more...)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Law (1.00)
Information Technology > Services (1.00)
(2 more...)

Technology:

Information Technology > Communications > Web (0.91)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.53)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.52)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.38)

Add feedback

In-Browser LLM-Guided Fuzzing for Real-Time Prompt Injection Testing in Agentic AI Browsers

Cohen, Avihay

arXiv.org Artificial IntelligenceOct-16-2025

AI-powered browser assistants (also known as autonomous browsing agents or agentic AI browsers) are emerging tools that use LLMs to help users navigate and interact with web content. For example, an AI agent can be instructed to summarize a webpage or perform actions like clicking links and filling forms on behalf of the user. While these agents promise enhanced productivity, they also introduce new security risks. One major risk is prompt injection, where an attacker embeds malicious instructions into web content that the agent will process [5]. Crucially, such instructions can be hidden from the human user (e.g., invisible text, HTML comments) yet still parsed by the LLM, causing it to alter its behavior in unintended ways [10]. In effect, the agent can be tricked into executing the attacker's commands rather than the user's, leading to potentially severe consequences [2]. Indirect prompt injections have been demonstrated in real-world scenarios.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2510.13543

Genre: Research Report > New Finding (1.00)

Industry: